Graph Search Algorithms



Graph Traversal Algorithms

Graph traversal refers to a process that traverses vertices of a graph following a certain order (starting from user-input sources). This category of graph search algorithms only seeks to find a path between two nodes, without optimizing for the length of the final route. In applications where the weight of edges in a graph are all equal (e.g. 1), BFS and DFS algorithms outperform shortest path algorithms like Dijkstra’s.

Breadth-first Search (BFS)

BFS is an algorithm where the traversal starts at a specified node (the source or starting node) and continues along the graph layerwise, thus exploring all exploring all of the the current node’s neighbouring nodes (those which are directly connected to the current node). If a result is not found, the algorithm proceeds to search the next-level neighbour nodes.

BREADTH-FIRST-SEARCH(source,destination) return a route
frontiera FIFO initialized with source node
exploredempty
foundFalse

while frontier is not empty and found is False do
nodefrontier.pop()
add node to explored
for child in node.expand() do
if child is not in explored and child is not in frontier then
if child is destination then
routechild.route()
foundTrue
add child to frontier
return route

Using BFS, search for the shortest path between The Equestrian Statue and the Bahen Centre. This example uses the same data as in Getting Started.

Note

This book uses the smart_mobility_utilities package for some operations, in order to simplify the process of visualizing graphs. You can find out more about downloading and installing the package here.

Let’s first find the largest connected component centered around our location, with a specified distance on each side. The reference point is the centre of the University of Toronto’s downtown campus.

import osmnx
reference = (43.661667, -79.395)
G = osmnx.graph_from_point(reference, dist=500, clean_periphery=True, simplify=True)

To plot the network, we will also need to highlight the starting and ending nodes. For the sake of simplicity, we will use the node id directly. For more information on finding the closest node to a given coordinate, refer back to the Getting Started section’s example.

highlighted = [1907446268, 1633421938]

# marking both the source and destination node

nc = ['red' if node in highlighted else '#336699' for node in G.nodes()]
ns = [50 if node in highlighted else 8 for node in G.nodes()]
fig, ax = osmnx.plot_graph(G, node_size=ns, node_color=nc, node_zorder=2)
../_images/GraphSearchAlgorithms_6_0.png

Let’s visualize the above graph on a ipyleaflet map, using a helper function from the smart_mobility_utilities package.

from smart_mobility_utilities.viz import draw_map
draw_map(G,highlight=highlighted, force_leaflet=True)

Warning

For the purposes of this section, we use the force_leaflet option so that the map will be rendered by ipyleaflet. Normally, when there are more than 1,000 nodes in a graph, ipyleaflet performance is very slow. The visualization tools in smart_mobility_utilities will automatically switch to folium when there are more than 1,000 nodes, unless the force_leaflet flag is used. See the docs for smart_mobility_utilities for more information.

Currently, each node in the above graph is represented as a python dict with many attributes that are of no interest to us. This makes accessing certain properties of nodes overly complicated and verbose. To minimize this, we can use the Node class from smart_mobility_utilities.common to redefine the nodes, and only retain key information like parent, edge length from parent, and the node itself. You can view the structure of Node here.

from smart_mobility_utilities.common import Node, cost
from tqdm.notebook import tqdm
from collections import deque

# First convert the source and destination nodes to Node
origin = Node(graph=G, osmid=1907446268)
destination = Node(graph=G, osmid=1633421938)

# Setup a progress bar using tqdm, max is V+E
time_complexity = len(G.nodes)+len(G.edges)
bar = tqdm(total=time_complexity)

route = []
frontier = deque([origin])
explored = set()
found = False

while frontier and not found:
    node = frontier.popleft()
    explored.add(node)
    for child in node.expand():
        if child not in explored and child not in frontier:
            if child == destination:
                route = child.path()
                found = True
            frontier.append(child)
        bar.update(1)

bar.close()
print("Route: \n",route,"\n\n Cost:\n",cost(G,route))
Route: 
 [1907446268, 55808224, 55808416, 55808284, 1721866234, 389678268, 4953810915, 389678267, 24960090, 24960068, 1258698109, 389678145, 24960070, 24960073, 24960076, 24960080, 6028561924, 5098988924, 389678131, 6028562356, 854322047, 389677908, 24959560, 242413453, 749951161, 7311057931, 389678216, 389678215, 389678226, 1633421933, 1633421938] 

 Cost:
 1385.1159999999998

Note

We used collections.deque instead of a list because of the time-complexity required for pop and append. With pythonic list, these two operations take O(n) time, while collections.deque operates in O(1) time, as it is based on LinkedList.

Additionally, tqdm.notebook was used because we are rendering the progress in a Jupyter Notebook. For normal use, you can import directly from tqdm.

As you can see, the BFS algorithm managed to find the route from source to destination in under a second, having transversed 91% of the graph to find it. Let’s plot the route.

fig, ax = osmnx.plot_graph_route(G, route)
../_images/GraphSearchAlgorithms_12_0.png

Let’s also plot it on a map.

from smart_mobility_utilities.viz import draw_route
draw_route(G, route, force_leaflet=True)

Depth-first Search (DFS)

The DFS algorithm is a recursive algorithm that uses the idea of backtracking. It involves exhaustive searches of all the nodes by going as deep as possible into the graph. When it reaches the last layer with no result, it “backtracks” up a layer and continues the search.

DEPTH-FIRST-SEARCH(source,destination) return a route
frontiera LIFO initialized with source node
exploredempty
foundFalse

while frontier is not empty and found is False do
nodefrontier.pop()
add node to explored
for child in node.expand() do
if child is not in explored and child is not in frontier then
if child is destination then
routechild.route()
foundTrue
add child to frontier
return route

As you may have the noticed, the only difference between DFS and BFS is in the way that frontier works. Rather than working down layer by layer (FIFO), DFS drills down to the bottom-most layer and moves its way back to the starting node (LIFO).

Let’s implement this algorithm with our previous example.

time_complexity = len(G.nodes) + len(G.edges) # Maximum is V + E
bar = tqdm(total=time_complexity) 

route = []
frontier = deque([origin])
explored = set()
found = False

while frontier and not found:
    node = frontier.pop()
    explored.add(node)
    for child in node.expand():
        if child not in explored and child not in frontier:
            if child == destination:
                route  = child.path()
                found = True
                continue
            frontier.append(child)
        bar.update(1)

bar.close()

print("Route: \n",route,"\n\n Cost:\n",cost(G,route))
Route: 
 [1907446268, 55808205, 55808194, 55808408, 55808414, 8711144452, 55808328, 55808437, 3210497979, 389677988, 1686556839, 389677984, 50885180, 36607322, 389677990, 389677993, 390545921, 60654129, 60654120, 50897854, 50897859, 389678001, 389678002, 2143434369, 390550470, 389678003, 390548860, 389678004, 771950946, 984911356, 728157228, 306721042, 389678005, 2143487625, 389678007, 5277943137, 2498969982, 389677902, 390545068, 390545043, 306725181, 390545044, 771931704, 775377001, 771950967, 8608123052, 771931728, 8608123055, 8608123068, 3554867351, 390545045, 390545047, 390545049, 390545050, 389678203, 390545078, 390545077, 24959544, 389678013, 389678205, 389678206, 389678207, 3996667046, 3996667045, 389678209, 389678210, 389678054, 389678175, 1432347915, 389678044, 389678043, 215726254, 3983181527, 389678211, 389678177, 24959555, 389678042, 389678184, 389678183, 389678216, 389678215, 389678214, 24959557, 389678218, 389678219, 773004741, 773004737, 5567060881, 5567060879, 1005007860, 1005007861, 24959565, 24959569, 389678241, 7311150229, 7311150221, 7311150222, 7311150218, 7311142107, 7311142109, 24959589, 390545033, 152659384, 389678238, 389678239, 389678240, 389678225, 389678245, 389678229, 729406374, 389678246, 29604723, 2143440970, 1458386384, 391188278, 389678247, 249989991, 391188296, 249990004, 389678248, 389678249, 6123553651, 6532307387, 6532307390, 969631968, 409731632, 4579468982, 969631975, 389678250, 394502545, 394502565, 394502563, 2809034239, 4380884143, 4380884142, 4376693531, 389678136, 389677925, 389678134, 2557539827, 389678133, 389677909, 749952029, 389677908, 389678222, 7311057936, 7311057937, 6028562355, 2557542523, 389677907, 239055729, 389678039, 389678040, 389677889, 389678220, 749951161, 393676412, 7311036242, 7311057933, 2557539817, 2557539816, 389678227, 1633422235, 1633421933, 1633421938] 

 Cost:
 6092.076000000003
draw_route(G,route, force_leaflet=True)

It is very evident that the paths generated by our DFS and BFS implementations are not the most direct route. This is because both DFS and BFS are algorithms that can find routes between two nodes, but make no guarantees that they will return the shortest path. Additionally, DFS generally returns “deeper” results as it traverses the entire depth of the graph and works backwards to find a solution.


Shortest Path Algorithms


Hill Climbing

The idea of the algorithm is quite simple:

Starting with a known (non-optimized) solution to a function that is to be optimized, the algorithm checks the neighbours of that solution, and chooses the neighbour that is “more” optimized. The process is repeated until no “better” solution can be found, at which point the algorithm terminates.

While the algorithm works relatively well with convex problems, functions with multiple local maxima will often result in an answer that is not the global maximum. It also performs poorly when there are plateaus (a local set of solutions that are all similarly optimized).

HILL-CLIMBING(source,destination) return a route
current ← random route from source to destination
neighbours ← children of current

while min(neighbours) < current do
current ← min(neighbours)
neighbours ← children of current
return current


Here, we introduce a few new ideas.

First, we treat the route between two nodes as a function, the value of which is the distance between the two nodes. Second, we generate “children” of this function.

The function

We need to define a function \(f\) that is our target for optimization.

\(f(x)\) gives us the length of a route for some given route \(x \in Y\), where \(Y\) is the set of all possible routes between two specific nodes.

How do we generate \(x\)? We could just generate random permutations between the two nodes, filtering for permutations that are feasible, and optimize \(f\) over these random, sparse permutations.

However, this method is not reproducible (because the permutations change every run).

Instead, we make a deterministic policy that generates a number of \(x \in Y\) by successively “failing” nodes between the source and destination nodes. We then find the shortest path between the nodes before and after the “failed” nodes.

By failing the nodes in a deterministic fashion, we can say that we have a function and neighbourhood with defined size for a certain value so we can “rigorously” conduct a local search.

To generate our initial route and children routes, we will use the smart_mobility_utilities package. You can see how these routes are generated by consulting the documentation for that package.

from smart_mobility_utilities.common import randomized_search, children_route
from itertools import islice

# Set the number of children to generate
num_children = 25

current = randomized_search(G, origin.osmid, destination.osmid)
print("Initial cost:",cost(G,current))

neighbours = [*islice(children_route(G, current), num_children)]
shortest = min(neighbours , key = lambda route : cost(G, route))
print("Initial min(children):",cost(G,shortest))

while cost(G, shortest) < cost(G, current):
    current = shortest
    neighbours = [*islice(children_route(G, current), num_children)]
    shortest = min(neighbours , key = lambda route : cost(G, route))
    print(f"Current cost:",cost(G, current),"|","min(children):",cost(G, shortest))

route = current
print("Final cost:",cost(G,route))
Initial cost: 1667.549
Initial min(children): 1359.4699999999996
Current cost: 1359.4699999999996 | min(children): 1330.581
Current cost: 1330.581 | min(children): 1324.9609999999998
Current cost: 1324.9609999999998 | min(children): 1330.581
Final cost: 1324.9609999999998
draw_route(G,route, force_leaflet=True)

While the implementation above is deterministic in nature, the initial route is still randomized. That means that it’s possible to get different results across runs.

Hill climbing will generally return some decent results as there are few local optimal points in the route function. However, with larger search spaces that will naturally have more local maxima and plateaus, it will get stuck fairly quickly.



Summary

The following is a comparison of the time complexities involved in the three previous graph search algorithms, using our graph:

Algorithm

Time Complexity

Route Length

Time

BFS

\(O(V+E)\)

1,385.11 m

16 s

DFS

\(O(V+E)\)

6,092.08 m

15 s

Dijkstra’s

\(O(V+ElogV)\)

1,029.48 m

17 s

Hill Climbing

\(O(\infty)\)

varies

varies

While BFS may be faster in terms of processing time, it cannot guarantee that it will return the shortest path. DFS is not optimized for shortest path calculation, as it will tend to find “deeper” solutions (i.e. ones with longer lengths).